Advertisement
Bluesky

Someone Made a Dataset of One Million Bluesky Posts for 'Machine Learning Research'

A Hugging Face employee made a huge dataset of Bluesky posts, and it’s already very popular. 
Someone Made a Dataset of One Million Bluesky Posts for 'Machine Learning Research'

Update: Following the publication of this article on Tuesday evening, van Strien removed the dataset. "I've removed the Bluesky data from the repo," he wrote on Bluesky. "While I wanted to support tool development for the platform, I recognize this approach violated principles of transparency and consent in data collection. I apologize for this mistake."

A machine learning librarian at Hugging Face released a dataset composed of one million Bluesky posts, complete with when they were posted and who posted them, intended for machine learning research.

Daniel van Strien posted about the dataset on Bluesky on Tuesday:

First dataset for the new @huggingface.bsky.social @bsky.app community organisation: one-million-bluesky-posts 🦋 📊 1M public posts from Bluesky's firehose API 🔍 Includes text, metadata, and language predictions 🔬 Perfect to experiment with using ML for Bluesky 🤗 huggingface.co/datasets/blu...

Daniel van Strien (@danielvanstrien.bsky.social) 2024-11-26T13:50:34.824Z

“This dataset contains 1 million public posts collected from Bluesky Social's firehose API, intended for machine learning research and experimentation with social media data,” the dataset description says. “Each post contains text content, metadata, and information about media attachments and reply relationships.” 

Sign up for free access to this post

Free members get access to posts like this one along with an email round-up of our week's stories.
Subscribe
Advertisement